The American Journal of Human Genetics
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match The American Journal of Human Genetics's content profile, based on 206 papers previously published here. The average preprint has a 0.20% match score for this journal, so anything above that is already an above-average fit.
Wang, H.; Wainschtein, P.; Sidorenko, J.; Fikere, M.; Zhang, Y.; Kemper, K. E.; Zheng, Z.; Hivert, V.; Zeng, J.; Goddard, M. E.; Visscher, P. M.; Yengo, L.
Show abstract
Assessing the contribution of ultra-rare variants (minor allele frequency <0.01%) to the heritability of complex traits remains challenging due to limited understanding of potential biases. Here, we focus on singletons (that is, variants observed only once in the study sample), the most abundant class of ultra-rare variants, to showcase various confounders of heritability estimates and underline pitfalls in their interpretation. We show through theory, simulations, and analysis of 5,330,210 exome-sequenced singletons in 305,813 unrelated European-ancestry individuals in the UK Biobank that (i) population stratification induces both upward and downward biases in singleton-based heritability estimates (), (ii) estimates capture non-additive genetic effects, and (iii) asymptotic standard errors of estimates from likelihood-based procedures are generally mis-calibrated when traits are not normally distributed. We further showcase these biases in real-data analyses of 22 quantitative phenotypes and report, after accounting for these pitfalls, significant estimate for number of children (3.4%), peak expiratory flow (1.9%), red blood cell count (2.5%), white blood cell count (1.9%) and heel bone mineral density (2.4%). Overall, our study provides recommendations for robust inference of heritability from ultra rare variants and underscores that reliable estimates for ordinal and binary traits will require far larger sample sizes and improved methods, given that confounding in these traits remains difficult to detect and correct
Satterstrom, F. K.; Jodeiry, K.; Mahjani, B.; Hatem, G.; Park, S. J.; Klei, L.; Fu, J. M.; Wigdor, E. M.; the Autism Sequencing Consortium, ; Betancur, C.; Daly, M. J.; Roeder, K.; Devlin, B.; Buxbaum, J. D.; Cutler, D. J.
Show abstract
Autism spectrum disorder (ASD) is estimated to be up to four times as common in males as in females, yet the causes of this prevalence difference are not well established. One possible driver is genetic variation on the X chromosome, as it contains genes capable of contributing to ASD (e.g., PTCHD1, MECP2) and is known to play a role in genetic disorders with differential sex prevalence (e.g., color blindness). However, a lack of power compared to the autosomes combined with the complexities of modeling its biology have led to the X being largely overlooked in sequencing studies. Here, we develop quantitative X-linked TADA, a new model designed specifically for application to this chromosome, and use it to analyze rare variation from 50,663 individuals with ASD (and 136,670 individuals total). We find 9 genes on the X associated with ASD at a false discovery rate (FDR) < 0.05 and an additional 9 genes at FDR < 0.2, with many of these previously identified as involved in specific neurodevelopmental disorders. Point estimates of the liability conferred by de novo variants on the X are similar in females and males, with both sexes estimates elevated >20% above the corresponding autosomal values. We also develop a general theory of how X-linked variation of any additive or non-additive effect influences liability and describe its implications for prevalence. Using this theory and our empirical results, we show how genetic variation on the X could contribute to the sex-differential prevalence of ASD.
Zheng, W.; Liu, T.; Xu, L.; Xie, Y.; Jing, Y.; Shao, H.; Zhao, H.
Show abstract
Phenome-wide association studies (PheWAS) enable systematic exploration of relationships between genetic variants and clinical phenotypes derived from electronic health records (EHRs). Conventional regression-based PheWAS treats phenotypes separately and relies on binary phenotype representations, which limits statistical power for rare variants and rare phenotypes and reduces the ability to detect associations with phenotypes that are distributed across clinical codes. To address this limitation, we first developed EmbedPheScan, a phenotype embedding-based prioritization framework that summarizes the phenotypic profiles of rare loss-of-function variant carriers in a continuous embedding space. We then proposed EA-PheWAS by combining these embedding-derived signals with conventional regression-based PheWAS results using the aggregated Cauchy association test. Using the UK Biobank whole-exome sequencing and EHR data, we show that the proposed methods maintain appropriate false-positive control. We then performed genome-wide phenome scans across all genes and across biologically defined gene classes to evaluate EA-PheWAS relative to conventional PheWAS and EmbedPheScan, consistently finding that EA-PheWAS outperformed the other two methods. We illustrate the utility of EA-PheWAS focusing on four genes representing distinct scenarios, including strong-effect disease genes (PKD1, PKD2), genes with large numbers of rare LoF carriers (NF1), and genes with extremely sparse carrier counts (FBN1).
Lin, J.-R.; Zhang, Z.
Show abstract
Mosaic loss of chromosome Y (LOY) is a common age-associated somatic alteration in men and is typically measured from DNA-based assays. Many cohorts, however, contain bulk RNA-seq data without matched DNA-based LOY measurements. We developed a Bayesian framework to estimate the fraction of cells with LOY from male bulk RNA-seq by modeling reduced Y-linked gene expression relative to expected expression after adjustment for age, expression covariates, and autosomal/X-linked control genes. In 377 male GTEx samples, individual Y-linked genes showed negative correlations with separately obtained DNA-based LOY measurements, supporting a shared Y-expression depletion signal. The primary fast empirical Bayes estimator achieved a Pearson correlation of 0.678 with measured LOY, a mean absolute error of 1.79%, a root mean squared error of 3.72%, and 95.2% empirical coverage of measured LOY. Performance was strongest for identifying large LOY events, with an AUC of 0.964 for measured LOY greater than 20%, while fine ranking among low-LOY samples remained uncertain. A mixture/PCA hierarchical Bayesian sensitivity model provided similar validation performance and interpretable posterior quantities but did not improve point estimation. Leave-one-Y-gene-out and prior-sensitivity analyses showed that the signal was distributed across multiple Y-linked transcripts and that prior shrinkage affected calibration. In an external whole-blood RNA-seq dataset without measured LOY, estimated LOY showed a modest age-related increase, but ex vivo immune stimulation shifted RNA-derived LOY estimates and reduced multiple Y-linked transcripts, indicating transcriptional confounding. These results show that bulk RNA-seq contains usable information about LOY, especially for larger events, but RNA-derived LOY should be interpreted as a probabilistic transcriptome-based estimate rather than a direct substitute for DNA-based mosaicism measurement.
Aguirre, M.; Irudayanathan, F. J.; Crow, M.; Hejase, H. A.; Menon, V. K.; Pendergrass, R. K.; McCarthy, M. I.; Fletez-Brant, K.
Show abstract
Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods -- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA -- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein dis-tances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant re-sults that were enriched (1.8-5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our anal-ysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.
Wang, Y.; Tuftin, B.; Raffield, L. M.; Hidalgo, B.; Kerns, S. L.; DeWan, A. T.; Leal, S. M.; Auer, P.
Show abstract
Individuals with admixed ancestry comprise a significant proportion of populations of the Americas. Statistical methods have been developed to specifically leverage local ancestry inference to enhance the power and interpretability of genome-wide association studies in admixed populations. However, no such methods currently exist to test for rare-variant aggregate associations. Here we present LANTERN (Leveraging local ANcestry Tracts to Enhance Rare variaNt aggregate associations), a method that infers the alleles that lie on each ancestral haplotype and conducts rare-variant aggregate association testing in a generalized linear mixed model framework. Through simulation studies we demonstrated that LANTERN achieves proper control of Type 1 error while boosting power to detect associations when causal alleles predominately lie on one ancestral haplotype. Using data from a cohort of African American participants from the Jackson Heart Study, LANTERN identified two genes known to be involved in red-blood cell (RBC) biology when local ancestry information was incorporated. Specifically, a burden of rare alleles on European ancestral haplotypes in EPO was associated with both hemoglobin levels (HGB) and RBC counts, whereas a burden of rare alleles on African ancestral haplotypes in EPB42 was associated with HGB and RBC. In summary, LANTERN (i) allows for the identification of ancestry-specific rare-variant associations; and (ii) enhances rare-variant association signals compared to an analysis that ignores local ancestry. LANTERN is implemented in R and is freely available on GitHub.
Oubninte, S.; Ruczinski, I.; Yanek, L. R.; Mathias, R.; Bureau, A.
Show abstract
Few studies assessed the performance of population-based phasing combined with parental genotypes to infer recombination on whole genome sequence (WGS) data. In this study, our objective was to evaluate whether Shapeit2 duoHMM, a Hidden Markov Model using parental genotypes, infers recombination events reliably on WGS and with narrower intervals than SNP arrays. We based our analysis on the overlap between recombination events inferred by Merlin on SNP genotypes and Shapeit2 on WGS and SNP genotypes. We used a sample of 61 extended families from the GeneSTAR study with TopMED freeze 8 WGS on 580 sequenced subjects (60% of sample). Shapeit2 was run with a window size of 500 kilobases and 200 states on WGS. To mimic a SNP array, we extracted genotypes of 355,112 autosomal markers on the Illumina OmniExpress array. The number of recombination events per meiosis inferred by Shapeit2 on the WGS data (36.8) was aligned with the expected numbers over autosomes (35.7), although Merlin overestimated this number (115.0). 73% of Shapeit2 recombination events on WGS were detected by Merlin, a proportion rising to 91% when restricting to events also inferred by Shapeit2 on OmniExpress genotypes. Furthermore, Shapeit2 recombination intervals were narrower on WGS than OmniExpress genotypes (median of 4,530 bp vs. 49,458 bp). This suggests that Shapeit2 on WGS is a reliable and accurate method for inferring recombination events.
Zhang, L.; Paterson, A. D.; Sun, L.
Show abstract
Testing for Hardy-Weinberg equilibrium (HWE) is a fundamental component of genetic data analysis, widely used for quality control and model validation. Although HWE testing is well established for autosomal loci, inference on the X chromosome is more complex due to sex-specific genotype structures and potential sex differences in minor allele frequency (sdMAF). Existing tests differ in their assumptions about sdMAF and male sample inclusion, often leading to distinct but poorly characterized null hypotheses. We develop a general statistical framework for HWE inference using the robust allele-based regression model. By formulating HWE testing as an assessment of allele-level dependence, the framework directly parameterizes Hardy-Weinberg disequilibrium, unifies existing Pearson{chi} 2-based tests under explicit modeling assumptions, and clarifies their null hypotheses, degrees of freedom, and sensitivity to sdMAF. The framework also accommodates covariate and population-structure adjustment within a unified regression-based formulation. The proposed framework provides robust, interpretable, and flexible inference, establishing a unified statistical foundation for HWE testing across autosomal and X-chromosomal regions. Simulation studies and analysis of high-coverage 1000 Genomes Project data demonstrate that commonly used X-chromosome tests can exhibit inflated type I error or misleading inference when sdMAF is present.
Dudek, M. F.; Wenz, B. M.; Voight, B. F.; Almasy, L.; Grant, S. F. A.
Show abstract
The vast majority of trait-associated loci discovered through genome-wide association studies (GWAS) are non-coding, yet most lack statistical alignment with any discovered expression quantitative trait loci (eQTLs). In particular, eQTLs are depleted at gene-distal regions and at "functionally important" genes - those with strong selective constraint and complex regulatory landscapes - likely due to selective depletion of high-effect variants. Here, we investigate the role of variants with weaker effects on expression transmitted through distal regulatory elements, which are detectable as chromatin accessibility QTLs (caQTLs). We aggregated caQTL data from ten studies derived across different tissues, cell-types and lines, representing 104,024 lead caQTLs across 3,457 samples. We found that, across a range of gene properties, caQTLs are discovered at functionally important genes more often than eQTLs. These observations are consistent with a model in which many eQTLs and GWAS hits are mediated through genetic effects on regulatory elements, which may have weak or context-dependent effects on gene expression. Our results suggest that caQTL discovery is more sensitive than eQTL discovery in capturing the molecular consequences of GWAS hits, and can provide complimentary information to eQTLs by implicating functional mechanisms of additional disease-associated loci.
Huang, Z.; Costantino, M.; Dahl, A.
Show abstract
Large-scale biobanks have enabled increasingly complicated genetic analyses across thousands of phenotypes. However, studies rarely consider the appropriate phenotype measurement scale, a problem that can drastically affect inferences on genetic architecture. Here, we introduce SIQReg, a practical solution to this classical problem, which learns a data-driven phenotype scale by minimizing heterogeneity across phenotype quantiles. Applied to complex traits in UK Biobank, SIQReg rejects the default scale for 24/25 traits. Generally, SIQReg scales lie between default and logarithmic, indicating that default-scale traits are neither purely additive nor purely multiplicative. We show that SIQReg improves both non-additive and additive genetic analyses. SIQReg eliminates most non-additive genetic signals (such as 97% of vQTL and 76% of quantile-dependent TWAS genes), indicating they may be statistical artifacts, while preserving biologically plausible non-additive signals. Simultaneously, SIQReg improves power to detect additive signals, increasing GWAS loci, TWAS genes, and PGS prediction accuracy by 11%, 13%, and 10%, respectively, and identifies 50% more high-risk individuals. These gains replicate across ancestry groups. Our results establish SIQReg as a principled approach to phenotype scale transformation that improves genetic analyses of complex traits.
Wang, J.; Morrison, J.
Show abstract
1Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causal relationships between complex traits. Standard MR can be used to estimate an average causal effect at the population level, and typically assumes a linear exposure-outcome relationship. Recently, several methods for estimating nonlinear effects have been developed. However, many have been found to produce spurious empirical findings when subjected to negative control analyses. We propose that this poor performance may be attributable to heterogeneity in variant-exposure associations. We demonstrate that heterogeneous genetic effects on exposure lead to biased estimates, poor coverage, and inflated type I error in control function and stratification-based methods. In contrast, two-stage least squares (TSLS) methods are robust to such heterogeneity, but suffer from low precision and low power in some circumstances. We show that a statistical test for heterogeneity can be used to guide the choice of nonlinear MR methods. Using UK Biobank data, we reassess the causal effects of BMI, vitamin D, and alcohol consumption on blood pressure, lipid, C-reactive protein, and age (negative control). We find strong evidence of heterogeneity for all three exposures, and also recapitulate previous results that control function and stratification-based methods are prone to false positives. Finally, using nonparametric TSLS, we identify evidence of nonlinear causal effects of BMI on HDL cholesterol, triglycerides, and C-reactive protein; however, specific estimates of the shape of these relationships are imprecise. Altogether, our results suggest that common nonlinear MR methods are unreliable in the presence of realistic levels of heterogeneity, and that more methodological development is required before practically useful nonlinear MR is feasible.
Manirakiza, A. V.; Baichoo, S.; Uwineza, A.; Dukundane, D.; Rugengamanzi, E.; Mutamuliza, J.; Niragira, A.; Muvunyi, R.; Besada, J.; Nielsen, S.; Bucknor, B.; Koeller, D. R.; Andrews, C.; Mutesa, L.; Fadelu, T.; Rebbeck, T. R.
Show abstract
Germline data from African populations remain sparse, limiting characterization of population-specific BRCA1/2 pathogenic variants. In a study of 175 Rwandan women with breast cancer, 7 unrelated carriers (4% of cases; 22% of pathogenic variant carriers) harbored the same BRCA1 frameshift variant, c.4065_4068del (p.Asn1355Lysfs*10), which is extremely rare in gnomAD yet recurrent in European, Asian, and Middle Eastern cohorts. Whole-exome sequencing and haplotype analysis of all 7 carriers revealed a shared ancestral block of approximately 581 kb surrounding the variant, and extended haplotype homozygosity and network analyses confirmed a common founder origin. Coalescent-based age estimation placed the founder event approximately 4,000--10,000 years ago. Comparison with 1000 Genomes Project data showed the founder haplotype is absent or exceedingly rare outside African and South Asian populations. These findings strongly suggest the c.4065_4068del variant as a pre-historical BRCA1 founder variant in Rwanda, with implications for targeted genetic testing, cascade screening, and cancer prevention in the region.
Liu, Z.; Ramteke, A.; Anand, A.; Gorla, A.; Jeong, M.; Sankararaman, S.
Show abstract
It is increasingly recognized that genetic effects on complex traits and diseases are shaped by environmental context. Biobanks that measure diverse environmental exposures alongside genotypes and phenotypes at scale enable systematic study of gene-environment (GxE) interactions. Existing approaches, however, are limited in their ability to accurately model polygenic GxE involving many exposures across genome-wide genetic variants. It is unclear which exposure combinations are relevant for a given trait while distinguishing true interactions from environment-dependent heteroskedastic noise. To address these challenges, we develop Efficient multi-eNvironmental Gene-environment Interaction iNference Estimator (ENGINE), a supervised variance-component framework that learns an embedding that combines multiple environmental exposures while jointly estimating additive, GxE, and heteroskedastic noise components. To enable biobank-scale inference, ENGINE makes a single pass over the genotype matrix to cache genotype-dependent summaries, then assembles normal-equation components and gradients at each iteration. In simulations, ENGINE controls type I error rates, achieves high power, and accurately recovers the environmental embedding while remaining efficient at biobank-scale. Applied to five complex traits paired with lifestyle exposures in N = 291,273 unrelated white British individuals and M = 454,207 common SNPs (MAF> 0.01) from the UK Biobank, ENGINE recovered GxE variance that was on average 1.4-fold larger than that captured by a single exposure and 5.5-fold larger than that captured by the first principal component of the exposures.
Hnizda, A.; Martinez-Delgado, B.; Sanchez-Ponce, D.; Alonso, J.; Amiel, J.; Attie-Bitach, T.; Bada-Navarro, A.; Baladron, B.; Bermejo-Sanchez, E.; Brinsa, V.; Bukova, I.; Cazorla-Calleja, R.; Cervenkova, S.; Chow, S.; Dusek, P.; Fedosieieva, O.; Fernandez-Prieto, M.; Ghosh, S.; Gomez-Mariano, G.; Gregorova, A.; Hamilton, M. J.; Hartmannova, H.; Hernandez-San Miguel, E.; Herrero-Matesanz, M.; Hodanova, K.; Kadek, A.; Kerkhof, J.; Kleefstra, T.; Lacombe, D.; Levy, M. A.; Lopez-Martin, E.; Lyse, R.; Man, P.; Marin-Reina, P.; Macnamara, E. F.; McConkey, H.; Melenovska, P.; Mielu, L. M.; Moore, D.;
Show abstract
EHMT1 and EHMT2 genes encode human euchromatin histone lysine methyltransferase 1 and 2 (EHMT1 alias GLP; EHMT2 alias G9a) that form heteromeric GLP/G9a complexes with essential roles in epigenetic regulation of gene expression. While EHMT1 haploinsufficiency has been established as the cause of Kleefstra syndrome 1, the pathogenesis of G9a dysfunction in human disease remains largely unknown. We identified seven de novo EHMT2 variants in patients with clinical presentation, episignatures, histone modifications and transcriptomic profiles similar to those of Kleefstra syndrome 1. In vitro studies revealed that these variants encode for structurally stable G9a proteins that are catalytically incompetent due to aberrant interactions either with histone H3 tail or with S-adenosylmethionine. Heterozygous mice carrying a patient-derived variant exhibited growth retardation, facial/skull dysmorphia and aberrant behavior. Here we report pathogenic EHMT2 variants that likely exert dominant-negative effect on GLP/G9a complexes and thus genocopy the EHMT1 haploinsufficiency via a distinct molecular mechanism, defining an autosomal dominant EHMT2-related Kleefstra syndrome.
Jacobsen, J. T.; Moller, P. L.; Rohde, P. D.
Show abstract
Genomics offer a powerful approach to identify causal mechanisms underlying coronary artery disease (CAD) risk, with implications for pathogenesis, personalized prevention strategies, and therapeutic target discovery. Functionality-informed genome-wide fine mapping was performed using the Bayesian framework SBayesRC to estimate genetic contributions of 6.9 million common variants, based on GWAS summary statistics from over one million individuals of European ancestry. Causal candidate genes were prioritized in a 5kB flanking window within high-confidence local credible sets (LCSs). Their downstream biological influence was analyzed using protein-protein interaction networks and pathway enrichment analyses across three complimentary dimensions: molecular, cellular, and disease level. Genetic modeling captured the highly polygenic architecture of CAD, estimating on average 34,000 variants to contribute to CAD risk, explaining 3.8% of total phenotypic variance. 36 high-confidence variants (PIP > 0.9) collectively explained 13.6% of genetic variance, while most variants demonstrated small individual effects but with substantial collective contributions. 17,150 variants were prioritized within 581 high-confidence LCSs, of which 195 were annotated to genes and 170 were implicated in downstream pathway analyses. The three most influential variants were mapped to PHACTR1, APOE, and LPL, explaining 2.49%, 1.59%, and 1.46% of genetic variance respectively. Pathway analyses revealed that genetic risk in CAD is driven by dysregulation of three interlinked biological processes: 1) lipoprotein function and cholesterol metabolism, 2) vascular homeostasis, and 3) cellular stress responses and inflammation. These findings advance the causal understanding of CAD pathogenesis, supporting the transition from association-based to functionality-informed genomic approaches in cardiovascular genetics.
Ravarani, C. N. J.; Arend, M.; Baukmann, H. A.; Cope, J. L.; Lamparter, M. R. J.; Sullivan, J. K.; Fudim, R.; Bender, A.; Malarstig, A.; Schmidt, M. F.
Show abstract
Human genetics has become a cornerstone of drug target discovery, yet the value of Mendelian randomization (MR) for predicting clinical success remains uncertain. Here, we systematically evaluated MR across 11,482 target-indication pairs with documented Phase II clinical outcomes to assess its utility for drug development. We find that MR statistical significance alone does not enrich for Phase II success, in contrast to genome-wide association study (GWAS) support, which confers an increase in success probability. However, this apparent limitation reflects the heterogeneous nature of clinical failure and the fact that MR encodes information beyond P values. When MR-derived features, including instrument strength and explained variance, are integrated into machine learning models, predictive performance improves substantially. An MR-informed XGBoost classifier identifies target-indication pairs with a 55% overall approval rate, corresponding to a 6.4-fold enrichment over unstratified programs and a 2.8-fold improvement over GWAS- supported targets in Phase II. Notably, this enrichment is achieved without reliance on statistically significant MR results. Our findings demonstrate that MR is most informative when treated as a graded, context-dependent source of causal evidence rather than a binary hypothesis test, and that its integration with machine learning enables scalable, genetics-informed prioritization of drug targets across the clinical pipeline.
Aziz, M. C.; Wilson, J.; Chow, C. Y.
Show abstract
PIGA-CDG is a congenital disorder of glycosylation caused by pathogenic partial loss-of-function variants in the PIGA gene. PIGA encodes an enzyme responsible for the catalytic transfer of N-acetylglucosamine to phosphatidylinositol during the first step of glycosylphosphatidylinositol anchor biosynthesis. Loss of this enzyme has a widespread phenotypic impact, but primarily results in neurological symptoms including seizures, intellectual disability, and developmental delay. Currently, treatments are limited and focus on symptom management. We developed an eye model of PIGA-CDG that has a reduced eye size. We screened a library of 98% 1,520 FDA/EMA-approved compounds to find drugs that improved the small eye phenotype. This screen revealed numerous drugs that improved eye size, including those that targeted dopamine signaling and cyclooxygenases. Using pharmacological and genetic approaches, we show that modulating dopamine signaling improves the eye size. Genetic inhibition of dopamine 2 receptor signaling and dopamine reuptake improve both the eye model and neurologically relevant PIGA-CDG phenotypes, including seizures and locomotor deficits. We also pharmacologically and genetically validate cyclooxygenase targeting drugs in the eye model. These findings reveal novel biology underlying PIGA-CDG and point towards candidate therapeutic approaches. AUTHOR SUMMARYPIGA-CDG is a rare neurodevelopmental disorder caused by pathogenic variants in the gene PIGA. Patients primarily display neurological symptoms, including seizures, developmental delay, and intellectual disability. Fewer than 100 patients have been identified, and treatment strategies are limited. In the context of rare diseases, de novo drug development is difficult due to the high cost, lengthy development times, and often too small of a patient population to conduct a clinical trial. Our lab leverages drug repurposing screening to circumvent many of the hurdles associated with de novo drug development. Here, we develop and screen FDA- or EMA-approved compounds on a Drosophila model of PIGA-CDG, uncovering novel biology underlying PIGA-associated pathophysiology. We use pharmacological and genetic tools to demonstrate that modifying dopamine signaling and abundance, as well as cyclooxygenase-mediated pathways, contribute to PIGA associated phenotypes. This work highlights promising therapeutic targets for PIGA-CDG.
Chang, X.; Hou, S.; Zhou, X.
Show abstract
Calibrated prediction intervals for polygenic scores (PGS) are essential for communicating individual-level uncertainty in genomic medicine. We present updated comparisons of two methods for constructing such intervals: CalPred, a parametric approach, and PredInterval, a non-parametric approach. Our results show that both methods can achieve calibrated coverage, although CalPred additionally requires a sufficiently large calibration set. The two methods also exhibit complementary trade-offs with respect to dataset size and risk identification. We further show that contextual calibration, as introduced in Hou et al. and followed in Shi et al., is most naturally achieved through appropriate phenotype normalization and data preprocessing. Apparent miscalibration can arise from inadequate normalization or from providing contextual information to some methods but not others. In UK Biobank, standard GWAS phenotype normalization procedures are sufficient to achieve contextual calibration for traits analyzed. In the extreme simulations of Hou et al. and Shi et al., supplying contextual covariates to PredInterval restores contextual calibration without normalization, and appropriate normalization can achieve contextual calibration without supplying covariates, while also substantially improving upstream tasks including association power and PGS accuracy. Together, these results underscore the central role of phenotype normalization and data preprocessing in GWAS analyses, including reliable uncertainty quantification for PGS.
Motegi, T.; Huang, F.; Campbell, J. D.
Show abstract
Local ancestry inference (LAI) enables high-resolution characterization of chromosomal segments inherited from distinct ancestral populations, offering unique insights into genetic architecture in admixed cohorts. While LAI is commonly performed with high-coverage whole-genome sequencing (WGS), the ability of other genotyping assays or varying sequencing depths has not been thoroughly benchmarked. In this study, we systematically evaluated the accuracy of LAI across SNP microarrays, whole-exome sequencing (WES), and ultra low-pass WGS (ULP-WGS) using diverse validation samples and state-of-the-art imputation pipelines. We show that ULP-WGS, when paired with GLIMPSE2, achieves robust accuracy at 0.25x coverage with a minimum genome window size of 0.5 centimorgans, with mean accuracy minus one standard deviation exceeding 95%. For WES, using "on-target" reads alone yields suboptimal performance, particularly for European and South Asian ancestries with accuracy less than 79.1% and 70.6%, respectively. However, incorporating "off-target" reads in WES and utilizing GLIMPSE2 substantially improved accuracy [≥]95% with a minimum window size of 0.2 centimorgans. We further evaluated formalin-fixed, paraffin-embedded (FFPE) samples and found that LAI could be performed successfully using WES data with accuracies of [≥]95% at a minimum window size of 0.5 centimorgans. In contrast, SNP microarrays did not achieve substantial accuracies at any window size ([≤]95%). Together, these results demonstrate that LAI is achievable without conventional high-coverage WGS and establish optimal parameters for LAI across platforms.
Abderrazzaq, H.; Singh, M.; Babb, L.; Bergquist, T.; Brenner, S. E.; Pejaver, V.; O'Donnell-Luria, A.; Radivojac, P.; ClinGen Computational Working Group, ; ClinGen Variant Classification Working Group,
Show abstract
Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([≤] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.